Sdd/dev#1
Open
sdevare-nv wants to merge 21 commits into
Open
Conversation
The opencode CLI registers TUI subcommands (Attach, TuiThread) via eager imports in index.ts. Loading them transitively imports cli/cmd/tui/app.tsx, whose JSX is meant to compile against @opentui/solid (per the package tsconfig's jsxImportSource) but is incorrectly resolved against react/jsx-dev-runtime when Bun runs the un-bundled .ts file at runtime — react isn't a dep of packages/opencode, so the bench harness crashes at startup before the run command ever executes. The bench harness (packages/opencode/src/bench/cli.ts) only ever spawns \`bun src/index.ts run\` as a subprocess; it never invokes the TUI. Remove the imports + command registrations entirely so the cli/cmd/tui/ subtree is unreachable from the bench code path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Running opencode's CLI un-bundled (`bun src/index.ts run`) triggers cascading runtime-resolution failures: tsconfig's `jsxImportSource` isn't honored for inline `.tsx` JIT compilation (we already removed the TUI commands to dodge that), and bun's isolated install layout under `node_modules/.bun/<pkg>@<ver>/...` breaks `..`-relative `.mjs` imports inside packages like @anthropic-ai/sdk (the `internal/to-file.mjs` import from `core/uploads.mjs` fails to traverse the symlink the way Node does at runtime). opencode is meant to be shipped as a pre-bundled single file (their own `bin/opencode` is built the same way) — `bun build` resolves every transitive import statically and inlines the result, so runtime never has to. The companion gym `setup_scripts/opencode.sh` now invokes `bun build packages/opencode/src/index.ts --outdir .bench-build --entry-naming opencode.js`. This commit teaches `bench/cli.ts` to prefer that bundle when it exists, and falls back to `src/index.ts` for dev / when the build step hasn't run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`provider/models.ts:137` does `import("./models-snapshot.js")` for a
static cache of models.dev metadata. Upstream generates this file at
build time via `script/generate.ts` (network fetch from models.dev),
and `.gitignore` correctly excludes it as a build artifact.
For the bench harness we register a single custom provider
(`@opencode-ai/nemo-gym`) in the per-instance opencode config and
never consult the snapshot. But `bun build --target=bun
packages/opencode/src/index.ts ...` still needs the import target to
exist at static-analysis time, otherwise it errors with:
error: Could not resolve: "./models-snapshot.js"
Force-add empty stubs so the bundle build succeeds without running
`script/generate.ts` (which needs network + adds 5+ seconds). The
runtime `try:` lambda in models.ts handles an empty snapshot fine.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…onfig
`PermissionActionConfig` accepts a glob-keyed map only for file/shell-style
permissions (edit, bash). webfetch / websearch use a different shape (a
single literal action), so emitting `{"*": "deny"}` for them trips the
config validator at startup:
Error: Configuration is invalid at .../opencode.jsonc
↳ Expected PermissionActionConfig | undefined, got {"*":"deny"}
agent.swe-bench.permission.webfetch
↳ Expected PermissionActionConfig | undefined, got {"*":"deny"}
agent.swe-bench.permission.websearch
Both tools are already disabled via the agent's `tools:` block
(webfetch: false, websearch: false), so the permission entries are
redundant. Removing them lets the per-instance config load cleanly and
the run command actually start.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
opencode's bundled-provider loader invokes the factory with
\`{ name, ...options }\` where `options` is the merged user + custom
provider defaults from `provider.ts`. For some providers opencode
injects an empty/static `headers: {}` into that merged options dict
(distinct from upstream openai-compatible's schema where `headers` is a
function `() => Record<string, string | undefined>`).
We were unconditionally invoking `this.cfg.headers?.()`. The optional
chain only short-circuits on null/undefined, so when opencode passed a
plain object it threw at session-start:
Error: this.cfg.headers is not a function.
(In 'this.cfg.headers?.()', 'this.cfg.headers' is an instance of Object)
Branch on `typeof headers` to support both shapes: call it if it's a
function, spread it directly if it's a plain object. Either way the
resulting kv pairs are merged into the per-request HTTP headers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The constructor merged caller config over defaults via:
this.cfg = { requestTimeoutMs: 600_000, retries: 3, ...cfg }
opencode's provider loader invokes the factory with options that
include explicit `undefined` for optional fields. Our factory in
`provider/sdk/nemo-gym/index.ts` faithfully passes those through:
new NemoGymLanguageModel(modelId, {
...,
requestTimeoutMs: opts.requestTimeoutMs, // undefined
retries: opts.retries, // undefined
})
The `...cfg` spread then overwrites the 600_000 / 3 defaults with
`undefined`. At runtime `setTimeout(fn, undefined)` is treated as
`setTimeout(fn, 0)` — so the abort timer fires before `fetch` can
even hand off the request. Result:
NeMoGym chat completions failed after 3 attempts:
AbortError: The operation was aborted.
with ~10 ms total elapsed between step_start and the error event.
Swap to the spread-first, coalesce-with-`??` pattern so undefineds
fall through to the defaults instead of clobbering them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the input row's responses_create_params.model is unset, the gym
side hands opencode an empty model string. opencode's session/provider
resolver then falls back to its sentinel `"default"` (or threads the
empty string straight through). We were faithfully POSTing
\`model: "default"\` to the gym openai_model server, which forwarded it
to OpenAI:
The model `default` does not exist or you do not have access to it.
The gym openai_model server already has the right pattern for this:
body_dict.setdefault("model", self.config.openai_model)
so omitting `model` from the outbound body lets the server fill in the
policy-configured model name (e.g. gpt-4.1-2025-04-14) without any
client-side knowledge of what the policy actually points to. Guard the
model assignment in `_buildRequestParams` to do exactly that.
The companion gym-side change is in OpenCodeHarnessProcessor.get_run_command:
when body.model is empty, fall back to the resolved
model_server_cfg.openai_model (or .model for vllm) before writing the
per-instance config, so opencode also sees a real string when it's
populating the agent's model field.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The todowrite tool is just an in-session planning aid — explicit multi-step task lists help the model decompose non-trivial fixes without affecting token-ID contiguity or subagent dispatch. Was disabled defensively; flipping it on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sugam Devare <sdevare@nvidia.com>
Strip git history past base_commit so the agent can't reach future commits via `git checkout <future-sha>` / `git log` exploration. Two-pass: careful per-ref iteration first; nuclear batch-delete fallback for monorepos with thousands of refs (e.g. datadog-agent with 5k+ release tags would time out the careful pass). Runs once in cli.ts before spawning the opencode session, scoped to the per-instance workspace, using the base_commit field from the instance JSONL. `|| true` at the tail so a busted git state can't kill the rollout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sugam Devare <sdevare@nvidia.com>
Mark each phase + each ref delete/reset with a `[deep_reset:careful]` or `[deep_reset:nuclear]` line so the agent log shows what happened. Mirrors the per-step `echo` markers in nv-OpenHands' run_infer.py deep-reset path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sugam Devare <sdevare@nvidia.com>
- language-model.ts: read sessionID/parentSessionID from `x-session-affinity` / `x-parent-session-id` request headers (set by opencode's session/llm.ts for non-opencode providers). Per-session turn counter prevents subagent dumps from clobbering the main session's. Filename includes sessionID so each session's per-turn JSONs sit side-by-side; payload now carries session_id, parent_session_id, and turn for downstream reconstruction of the agent tree. - cli.ts: new --enable-subagents flag; toggles `task: <bool>` in the per-instance opencode.jsonc. Default off. - run_infer.sh: forwards ENABLE_SUBAGENTS env -> --enable-subagents flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sugam Devare <sdevare@nvidia.com>
Replaces the 12-line bench-local DEFAULT_SYSTEM_PROMPT with the real anthropic system prompt opencode ships (`session/prompt/anthropic.txt`, ~105 lines), plus a short SWE-bench addendum (don't commit/format the diff, harness captures git diff as the patch, don't edit tests). `--system-prompt <path>` override still wins when supplied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sugam Devare <sdevare@nvidia.com>
The env block in session/system.ts (cwd, worktree, platform, today's date) is appended to the system prompt every turn. The `Today's date` field uses `new Date().toDateString()` — across a midnight rollover the system message shifts, which breaks the RL prompt-token-prefix invariant (turn[N+1].prompt_token_ids must extend turn[N]'s). Bench mode sets OPENCODE_DISABLE_ENV_PROMPT=1 in the child env so the environment() effect short-circuits to []. Normal `opencode run` is unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sugam Devare <sdevare@nvidia.com>
Some apptainer images ENOENT on bare program names through Bun's
posix_spawn (libuv falls back to PATH-less execv on some kernels /
musl builds). Symptoms in the wild:
[bench] deep_reset spawn error: ENOENT posix_spawn 'bash'
[bench] wrote ... error=opencode_exit_999 # bun spawn also ENOENT
Fixes:
- deep_reset.ts: probe /bin/bash, /usr/bin/bash, /bin/sh, /usr/bin/sh
and spawn the first that exists. The script is POSIX-compatible so
/bin/sh works fine when bash is absent (minimal/distroless SIFs).
Skip deep_reset with a warning if no shell is found.
- cli.ts runOpencode: use process.execPath (the currently-running bun
binary) instead of bare "bun".
- cli.ts captureGitDiff: probe /usr/bin/git, /bin/git, /usr/local/bin/git
with bare "git" as last-ditch fallback.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sugam Devare <sdevare@nvidia.com>
The previous fix moved to absolute binary paths but `/bin/bash` still
ENOENTs through Bun's posix_spawn whenever the spawn passes a `cwd`
option on minimal apptainer images (e.g., SWE-bench's astropy SIF).
Root cause: libc lacks `posix_spawn_file_actions_addchdir_np`, so
libuv's cwd-handling fallback fails the spawn outright instead of
falling back to fork+chdir+exec.
Workaround: don't pass `cwd` to spawn at all.
- deep_reset.ts: prepend `cd <workspaceRoot> && ` to the shell script
itself (shellQuote handles paths with spaces / special chars).
- cli.ts runOpencode: drop `cwd: workspaceRoot` from the bun spawn.
Opencode's existing `--dir <workspaceRoot>` flag changes its own
working directory, so spawn-level cwd was redundant anyway.
- captureGitDiff already uses `git -C <workspaceRoot>` (no cwd).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sugam Devare <sdevare@nvidia.com>
Plain `git -C <workspaceRoot> diff` only reports changes to tracked files. Every new file the agent creates via the `write` tool stays untracked, so it was silently dropped from the captured patch and the rollout was recorded as `patch_exists=false` / reward 0 even though the agent ran to completion (`reason: "stop"`, bench `exit=0`). In a recent SWE-bench batch (swebench_results_1780596228320_bcd81ba7), 8 of the 21 zero-byte patches were exactly this case — 7/8 had `write` calls to brand-new paths and never staged. Example trajectory: denoland-deno-8408-agentic wrote a 6499-byte std/encoding/csv_stringify.ts from scratch, type-checked, ran a functional test, and stopped. git_patch came back "" because the file was untracked. Fix: mark untracked files as intent-to-add (`git add -AN`) so they appear in `git diff` without being committed. Worktree-style diff is preserved so the SWE-bench evaluator's `git apply` still works. Also pass `--binary` so any binary artifacts the agent produces aren't silently truncated by the textual diff path. The system-prompt addendum already tells the model not to commit or push, which is still correct after this fix — the harness now picks up unstaged new files on its own. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Sugam Devare <sdevare@nvidia.com>
Some dataset SIFs (notably `swe-bench-ext`, and certain SWE-rebench
variants) copy the repo into the workspace as a flat source tree with
no `.git` directory. Without one:
* runDeepReset's first `git rev-parse` fails, the outer `|| true`
swallows the error, and deep_reset is a silent no-op.
* captureGitDiff runs `git -C <workspace> diff` against a non-repo;
git emits `fatal: not a git repository` to stderr and exits with
empty stdout, so the function returns "".
Net effect: every rollout for those SIFs lands as `patch=0 bytes`
regardless of how many files the agent edited. Recent symptom — in
swebench_results_1781123430109_686771d1 (swe-bench-ext, 37 instances),
every patch was zero bytes despite trajectories with up to 30 successful
`edit` calls on the workspace. `cd /workspace/repo && git status` in
the same SIF returns `fatal: not a git repository`.
Port of nv-OpenHands' run_infer.py:1142-1156 swe-bench-ext baseline:
detect the missing `.git`, init a local repo, snapshot the pristine
tree as `opencode_bench_baseline`, and skip deep_reset (the upstream
base_commit SHA doesn't exist in the fresh repo, so deep_reset would
just fall through to its nuclear pass and noisily no-op anyway).
This composes with 52efa89 (`git add -AN && git diff --binary`) — the
prior commit handled `.git`-present-but-untracked-write-files; this
commit handles `.git`-missing-entirely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sugam Devare <sdevare@nvidia.com>
Falling off the end of main() with no explicit process.exit() left Bun
to drain the event loop on its own, which produced a flaky exit=1 even
on successful runs that wrote a valid output.jsonl with `error=none`.
The residual handles seem to come from the spawned opencode subprocess
(piped stdio that hasn't been explicitly closed) and from opencode's
sqlite migration code — both stay registered long enough that Bun
treats the runtime drain as a non-clean exit.
Symptom in DeepSeek-V4-Flash_opencode_0-0-12711152.out (SWE-rebench-V2
train shard): three instances logged
[bench] wrote .../bench_run/output.jsonl (patch=N bytes, error=none)
with N=2001, 9046, ... — `error=none` means `result.exitCode === 0`,
so opencode itself exited cleanly and the patch was captured. But
apptainer exited 1, gym's runner raised
`RuntimeError("Command failed with return code 1")`, and the rollout
was discarded as a failure even though the bench succeeded.
Fix: explicitly `process.exit(result.exitCode === 0 ? 0 : 1)` after the
[bench] wrote log line. Deterministic, mirrors opencode's exit code,
and doesn't depend on the event-loop drain heuristic.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Sugam Devare <sdevare@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue for this PR
Closes #
Type of change
What does this PR do?
Please provide a description of the issue, the changes you made to fix it, and why they work. It is expected that you understand why your changes work and if you do not understand why at least say as much so a maintainer knows how much to value the PR.
If you paste a large clearly AI generated description here your PR may be IGNORED or CLOSED!
How did you verify your code works?
Screenshots / recordings
If this is a UI change, please include a screenshot or recording.
Checklist
If you do not follow this template your PR will be automatically rejected.